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CHI-SQUARE  TESTS* 


David  S.  Moore 
Purdue  University 

1.  INTRODUCTION 

Statistics  is  the  science  of  collecting,  describing  and  interpreting 

data.  The  most  common  mathematical  model  underlying  statistical  interpre- 
tation of  data  assumes  that  the  values  of  measured  variables  in  the 

population  of  interest  are  described  by  a probability  distribution.  If 
several  variables  are  measured  (say  the  length  and  weight  of  a cockroach) , 
the  population  is  described  by  a multivariate  probability  distribution. 

When  the  forces  of  good  prevail,  the  fortunate  statistician  has  data 
consisting  of  observed  values  of  independent  random  variables,  each  having 
the  population  probability  distribution.  The  statistical  design  of  sampling 
and  experimentation  is  intended  to  produce  this  happy  state  of  affairs  or 
some  moderate  complication  of  it.  We  will  assume  that,  whether  by  design 
or  (this  is  risky)  by  good  fortune,  the  data  collection  process  yields 
independent  random  variables  Xj,...,X  having  a common  probability  distri- 
bution. This  distribution  is  unknown  - that's  the  distinction  between 
statistics  and  probability  theory.  Let  F denote  the  unknown  distribution 
function  (df)  of  any  single  . 

It  is  clear  that  a classical  statistical  problem  is  "Which  probability 
models  adequately  describe  the  data?"  This  question  can  be  asked  for 
descriptive  purposes  or  as  a preliminary  to  formal  inference  from  the  data. 

‘Preparation  of  this  paper  was  supported  by  the  Air  Force  Office  of 
Scientific  Research  under  Grant  AFOSR-72-2350. 
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Particularly  in  the  latter  case,  the  statistician  may  have  in  mind  a 
specific  family  of  probability  distributions  (such  as  the  normal  family) 
and  the  more  exact  question  "Do  the  data  support  or  impugn  the  hypothesis 
that  the  population  distribution  is  a member  of  this  family?”  Most  common 
families  of  distributions  have  df's  of  specified  functional  form  indexed 
by  a (real  or  vector)  parameter.  For  example,  an  individual  member  of  the 
univariate  normal  family  of  distributions  is  specified  by  the  values  of  the 
mean  u and  standard  deviation  a.  If  G C • ! 6)  is  a family  of  df's  indexed  by 
a parameter  9 running  over  a parameter  space  S2,  we  have  now  formulated  the 
following  problem. 

Given  independent  random  vatUabtei,  Xj,...,X  having  common  unknown 

df  F,  te6t  the  kypotheili 

Hq:  F(-)  = G(*  |0)  fati  -iome  Q in  n 

This  is  the  problem  of  goodnei. 6 ol  lit.  Notice  that  in  practice  the 
observations  will  often  be  multivariate,  and  that  the  null  hypothesis 
will  usually  be  composite  (that  is,  the  family  G ( * J 0 ) will  contain  more 
than  one  member).  Notice  also  that  although  we  have  stated  the  problem  in 
terms  of  hypothesis  testing,  it  will  rarely  be  sensible  to  simply  accept 
or  reject  at  the  usual  significance  levels  such  as  a = .05.  In  particular, 
if  we  test  fit  to  (say)  the  univariate  normal  family  as  a preliminary  to  a 
further  analysis  which  assumes  normality,  we  should  surely  not  cling  to  the 
assumption  of  normality  until  the  evidence  against  it  is  significant  at  the 
five  percent  level.  Many  applied  statisticians  favor  using  an  a of  .20  or 
.25  for  such  preliminary  tests.  The  real  difficulty  is  that  the  HQ  in  the 
problem  of  fit  does  not  have  the  status  ("The  statement  we  hope  to  find 
evidence  against")  ascribed  to  null  hypotheses  in  standard  tests  of 
significance.  Nonetheless,  the  attained  significance  level  of  a test  of 
fit  is  at  least  a descriptive  measure  of  the  distance  of  the  data 


from  the  hypothesized  family  of  distributions.  We  will  therefore  study  the 
theory  of  some  tests  of  fit  without  further  ventures  into  the  wilderness  of 
philosophies  of  inference. 


The  oldest  family  of  tests  of  fit  was  fathered  by  Karl  Pearson  in 
1900.  During  the  preceding  decade,  Pearson  had  developed  families  of 
probability  distributions  in  the  course  of  his  work  on  M ath.ejnoUU.cal 
ContU.bLUU.on6  to  the.  The.ony  o{^  Evolution.  He  now  wished  to  see  which  of 
these  fit  his  data,  rather  than  simply  assuming  that  all  biological 
variables  are  normally  distributed.  Statistics  as  a discipline  was  in  its 
infancy  in  1900.  Many  results  and  methods  which  would  form  part  of  the 
new  science  were  scattered  through  the  work  of  Gauss,  Laplace,  Lagrange  and 
others,  but  these  results  were  not  collected  and  unified,  and  were  often 
unknown  to  statisticians  such  as  Pearson.  The  binomial  distributions  and 
their  approximation  by  normal  distributions  were  well  known;  the  chi-square 
distributions  were  known  as  the  distributions  of  sums  of  squares  of  inde- 
pendent normal  random  variables;  and  the  multivariate  normal  distributions 
had  only  recently  become  familiar.  These  last  distributions  will  play  a 
major  role  in  our  study.  Pearson  knew  the  p-variate  normal  distribution 
with  mean  vector  p and  nonsingular  covariance  matrix  £ as  the  distribution 
having  density  function  of  the  form 

(1)  f(y')  = ce’*5^"^ 'L 

Here  y'  = (y^...^  ) is  the  p-variate  argument  of  the  density  function. 

If  Y = (Yj , . . . ,Y  ) ' is  a random  variable  having  this  distribution,  we  will 
write  Y to  express  this  fact. 


Pearson  sought  first  to  test  the  simple  null  hypothesis  that  univariate 
observations  Xj,...,X  have  a given  df  G.  He  partitioned  the  line  into  cells 
and  based  his  test  on  the  frequencies  N^,...,NM  of  observations 
falling  in  these  cells.  If  the  hypothesis  is  true  and 

p.  = PG[X  in  E.]  = / dG(x) , 

^i 

then  np^  is  the  expected  frequency  for  and  the  quantities  - npi  measure 
the  lack  of  fit  between  data  and  model.  Pearson  then  argued: 

(a)  If  Y 'v  Np(0,E),  then  the  quadratic  form  which  appears  in  the  exponent 
of  the  density  function  (1)  has  the  same  distribution  as  the  sum  of  squares 
of  p independent  standard  normal  random  variables.  This  is  the  chi-square 
distribution  with  p degrees  of  freedom,  and  we  write  Y’E*1Y  'v  x2(p). 

(b)  By  the  DeMoivre- Lap lace  normal  approximation  to  the  binomial  distributions, 
each  N^-np^  is  approximately  normal  when  the  number  of  observations  n is 
large. 

(c)  Computing  variances  and  covariances  for  Y = (N.  - np.,...,Nw  . - npw  .)' 

1 1 M-l  XM- 1 

and  inverting  the  covariance  matrix  shows  that 


-1  M (Ni  - nPi)2 

1 ■ I ■ ■ . 


and  this  statistic  therefore  has  approximately  the  x (M-l)  distribution 
when  the  null  hypothesis  is  true  and  n is  large.  Large  values  of  this 
statistic  (i.e.,  values  in  the  upper  tail  of  thex2(M-l)  distribution)  are 
evidence  of  lack  of  fit. 

This  argument  contains  some  minor  mathematical  gaps:  it  ignores  the 

distinction  between  approximate  normality  of  each  - np.  and  approximate 

multivariate  normality  of  the  vector  Y,  and  does  not  show  how  (a)  implies 

- 1 2 

that  when  Y is  apptio XAjncutejiy  NM  .(0,£),  then  Y'E  Y is  approximately  x (M-l). 
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But  the  argument  is  in  the  best  spirit  of  pre-Weierstrass  mathematics, 
needing  only  a few  technicalities  to  become  rigorous.  Pearson's  proof 
shows,  for  example,  why  his  famous  chi-square  statistic  does  not  have 
the  variance  np^(l-p^)  of  in  the  denominator  of  the  ith  summand.  More 
important,  the  idea  of  reducing  the  general  problem  of  fit  to  a combinatorial 
setting  (counting  numbers  of  observations  in  each  of  M cells)  was  of 
lasting  significance.  Chi-square  tests  remain  among  the  most  common  tests 
of  fit,  largely  because  of  the  flexibility  of  Pearson's  idea.  If,  for 
example,  the  observations  and  the  cells  are  multidimensional,  the 
distribution  of  the  cell  frequencies  and  the  form  and  theory  of  the 
Pearson  chi-square  statistic  are  unchanged. 

Now  of  course  the  null  hypothesis  in  a problem  of  fit  is  generally 
that  the  df  of  the  observations  falls  in  a family  {G ( - J 0) : ft  in  ft}  of 

df's.  In  this  case,  the  cell  probabilities  depend  on  the  unknown  parameter  6, 


p (0)  = / dG(x | 0) . 

E. 

l 

Pearson  proposed  to  estimate  the  unknown  parameter  from  the  data  by  some 
function  0^  = en(X^, . . . , X^) . In  testing  fit  to  the  univariate  normal 
family,  for  example,  the  parameter  is  0 = (y,cr)  and  the  population  mean  y 
and  standard  deviation  a are  commonly  estimated  by  the  mean  and  standard 
deviation  of  the  sample.  The  Pearson  statistic  now  becomes 


(2) 


M [N.  - np.  (0  )]' 
£ i n 1 

i=l 


nPi(V 


That  is,  we  test  the  fit  of  the  data  to  the  df 
estimated  parameter  value. 


u 


Unfortunately,  as  mathematicians  learned  before  statisticians, 
pre-Weierstrass  mathematics  has  its  limitations.  Even  when  the  estimator 
0n  approaches  the  true  value  of  0 as  the  sample  size  n increases,  the 
statistic  (2)  does  not  then  have  approximately  the  x (M-l)  distribution, 

a 

as  Pearson  believed  it  did.  Since  statistical  methods  are  actually  used  in 

the  real  world,  observant  users  began  to  suspect  that  something  was  amiss. 

Some  even  did  extensive  simulations  (quite  a chore  in  those  pre-computer 

days)  to  compare  Pearson's  theoretical  distribution  with  the  observed 

1 distribution  of  his  statistic.  It  was  not  until  1924  that  R.  A.  Fisher 

2 

showed  that  the  statistic  (2)  does  not  have  approximately  the  x (M-l) 
distribution  in  large  samples,  and  that  the  distribution  it  does  have  depends 
on  how  th  .lknown  parameter  is  estimated.  If  ©n  is  the  value  of  0 which 
minimizes  the  statistic  (2)  for  given  (so  that  G(- J 0^)  is  the  closest  df  in 
the  hypothesized  family  to  the  data  by  this  measure  of  distance),  Fisher  showed 
that  the  approximate  distribution  is  x iM-m-1)  when  0 is  m-dimensional.  When 

I 

other  methods  of  estimation  are  used,  this  conclusion  is  false.  It  is  false, 
for  example,  when  the  sample  mean  and  standard  deviation  are  used  in  testing 
fit  to  the  univariate  normal  family.  Only  since  the  1950's  has  a rigorous  study 
of  chi-square  statistics  with  general  estimators  0fl  been  made,  and  solutions 
obtained  to  many  practical  and  mathematical  problems  concerning  these 
statistics. 


Our  study  of  the  modern  theory  of  chi-square  tests  of  fit  will  touch 
on  several  other  topics  in  statistics  which  are  important  in  their  own 
right.  We  must  first  acquaint  ourselves  with  the  multivariate  normal 
family  of  distributions.  And  since  chi-square  tests  are  large-sample  tests, 
based  on  the  limiting  multivariate  normal  distribution  of  the  cell  frequencies, 
some  of  the  basic  techniques  of  statistical  large  sample  theory  must  be 
mentioned.  Finally,  Pearson's  construction  of  the  proper  quadratic  form 


k iJk 


i 
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in  the  cell  frequencies  is  the  genes'  of  some  familiar  statistical 
procedures  other  than  tests  of  fit.  in  all  of  this,  we  hope  not  to  entirely 
lose  sight  of  the  interplay  between  theory  and  practice  which  gives 
statistics  its  vitality. 


c 

L 


I 


I 


k 


2.  THE  MULTIVARIATE  NORMAL  DISTRIBUTIONS 

The  multivariate  normal  family  plays  a role  in  the  study  of  multi- 
dimensional data  analogous  to  that  played  by  the  univariate  normal  family 
in  one  dimension.  These  distributions  are  not  only  important  probability 
models  in  their  own  right,  but  because  of  the  central  limit  theorem  serve 
as  large  sample  approximations  to  other  models.  We  will  not  define  these 
distributions  by  the  density  function  (1)  for  two  reasons.  First,  that 
definition  is  awkward  due  to  the  complexity  of  the  density  function.  More 
important,  a distribution  in  Euclidean  p-space  R^  does  not  have  a density 
function  in  R*3  if  it  assigns  probability  one  to  a set  of  measure  zero.  Such 
4>tngutaA  cUdt/ttbuttonA  - other  than  the  discrete  distributions  supported  on 
a countable  set  - are  somewhat  pathological  in  one  dimension,  and  play 
little  role  in  probability  modeling  there.  But  in  higher  dimensions  it  is 
quite  common  for  random  variables  to  be  dependent  in  such  a way  that  with 
probability  1 their  values  fall  in  a lower-dimensional  hyperplane  and  their 

joint  distribution  is  thus  singular.  The  cell  frequencies  N, ,...,NW  in  a 

1 M 

M 

chi-square  test,  for  example,  satisfy  E^_1Ni  = n and  so  take  values  only 
in  this  (M-l)-dimensional  hyperplane.  In  Section  1 we  followed  Pearson  in 
working  with  the  nonsingular  distribution  of  (N, , . . . ,NW  .)’,  but  this  mode 
of  escape  is  more  awkward  in  other  settings. 


__ - 
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To  statisticians,  the  most  useful  definition  of  a probability 

distribution  is  often  a aepxebentatsLonal  definition  - a statement  of 

2 

what  random  variable  has  the  distribution.  For  example,  the  x (p) 
distribution  is  that  of  Z?_^ZZ  where  Zj,...,Z  are  independent  N(0,1) 
random  variables.  In  this  spirit,  the  N^fu.Z)  dtbtriibution  ib  defined  ab 
the.  cUbtsiibution  o f the  random  va/u:M.ble 

(3)  Y = AZ  + y 

whe> ie  z = (Z1,...,Zm)'  a.nd  Z ^ ^ N (0, 1)  and  aAe  independent,  y = (^.....y  )' 
ib  the  vecton.  o f mean b,  and  A ib  any  p*m  matnix  battbfying  AA'  = E.  That  is, 
the  multivariate  normal  distributions  are  the  distributions  of  affine 
transformations  of  a set  of  independent  standard  normal  random  variables. 

It  is  easy  to  check  that  y and  AA'  = Z are  in  fact  the  mean  vector 
and  covariance  matrix  of  the  p- variate  random  vector  defined  in  (3).  To 
justify  this  definition  of  N (y,Z),  we  must  show  that  Y so  defined  has  the 
same  distribution  for  any  m and  any  pxm  matrix  A satisfying  AA'  = Z.  To 
justify  the  notation  N^(y,Z),  we  must  show  that  this  family  is  parameterized 
by  (y,Z)  alone.  Both  of  these  facts  follow  from  a computation  of  the 
characteristic  function  (Fourier  transform)  of  the  distribution  of  Y.  This 


computation  also  illustrates  the  convenience  of  the  representational 
definition. 

The  characteristic  function  of  Y is  the  function  of  p real  variables 

t'  = (tj,...,tp)  defined  by 

i t > y 

cpyCt ’)  = E [e1*^  Y] 

= Etei(t'AZ+t,,J] 

= eit,ME[eit,AZ] 

Now  since  the  characteristic  function  of  any  Z.  ,v  N(0,1)  is  easily  computed 


to  be 
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**’  "k 


E[eisZj)  = 


-oo  < s < 00 


and  Zj,...,Z[n  are  independent, 

i { ( t ’A)  Z.  + . . . + (t ’A)  Z } 
it’ll,,,  v 1 J 'mm, 

cpY(t')  = e E[e  ] 

it,  m i(t’A)  1, 

= e1Z  W H E[e  3 -'] 

if/;1  -«»■*)/ 

= e He  J 

j=l 

^t'p-^t'AA't  _ eit'p-lst,Zt 


This  characteristic  function  is  the  same  for  all  m and  A with  AA’  = E,  and 
is  parameterized  by  p and  E.  Since  the  characteristic  function  uniquely 
determines  a probability  distribution,  Np(p,E)  is  well  defined. 

The  definition  (3)  is  geometrically  transparent.  The  distribution  of 
Z is  nonsingular  in  Rm  - indeed,  has  all  of  Rm  as  its  support.  If  the  linear 
transformation  A:  Rm  -*■  Rp  has  full  rank  p,  then  the  distribution  N^Cp.Z) 

of  Y is  nonsingular  and  has  Rp  as  its  support.  If  the  rank  of  A is  r < p, 
then  Np(y,E)  is  singular  and  is  supported  on  the  r-dimensional  hyperplane 
in  Rp  obtained  by  translating  the  range  of  A by  p.  Since  the  range  of 
AA'  = E is  the  same  as  the  range  of  A,  Np(p,E)  i*  a nonsinguloA  cLii tAlbutio n 
*-i  and  only  if  the.  cnvaAla.nct  matAix  E i*  nominguloA.  The  *uppoAt  of 
Np(p,E)  i*  a hypeAplanc  In  Rp  uiith  dimension  equal  to  the  Aank  of,  E. 

Many  properties  of  the  multivariate  normal  distributions  follow  easily 
from  (3).  For  example,  if  B is  any  s*p  matrix,  then  applying  B to  both 
sides  of  (3)  shows  at  once  that  BY -v  Ns  (Bp , BEB ' ) . That  is,  any  *et  of  lincaA 
combination*  o(j  jointly  nonmal  nandom  va liable*  ha*  again  a multivariate 
noAmat  joint  di*tAi.butA.on . In  particular,  any  single  linear  combination  is 
univariate  normal,  and  this  includes  each  individual  component  Y.  of  Y. 
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That  the  random  variable  Y has  density  function  of  the  form  (1)  when  Z is 
nonsingu..  can  be  deduced  by  a change  of  variables  in  the  known  density 
function  of  Z. 

The  joint  distribution  of  a set  of  independent  univariate  normal  random 

variables  is  a multivariate  normal  distribution  with  a diagonal  covariance 

2 

matrix  Z.  (Because  N(p.,cr.)  is  the  distribution  of  Y.  = o.Z.  + p.  for 

l l 1111 

Zi  -v  N(0, 1)  , so  that  (3)  fits  Y = (Yj, . . . ,Yp)  • .)  Since  N (y,I)  is 
determined  by  p and  Z,  it  follows  that  random  vaAiabZeA  having  a muttZvaniate 
no  rmal  joint  dZbtsiibuiion  one  -independent  Zfi  and  only  i£  they  one  unco  mediated. 
Independence  in  a multivariate  normal  setting  can  therefore  be  established 
by  simply  computing  covariances.  If  I denotes  the  pxp  identity  matrix, 
Np(0,Ip)  now  denotes  the  distribution  of  a set  of  p independent  N(0,1) 
random  variables.  If  Z v Np(0,I  ) and  P is  a pxp  orthogonal  matrix,  it 
follows  that  PZ  'v  Np(0,Ip)  once  again. 

For  our  study  of  chi-squared  tests,  we  are  particularly  interested  in 

quadratic  forms  in  multivariate  normal  random  variables.  The  representational 

approach  reduces  this  to  the  study  of  quadratic  forms  in  independent  N(0,1) 

random  variables.  We  know  one  fact  about  such  forms:  if  Z -v  N (0,1  ) then 

P P 

the  sum  of  squares  Z'Z  = Z?_j  Zt^  has  the  x~(p)  distribution.  Two  potential 

generalizations  of  this  fact  come  to  mind  at  once.  When  Y >v  N (O.Z)  we  can 

P 

ask  (1)  What  quadratic  forms  Y'CY  have  chi-square  distributions?  (2)  What 
is  the  distribution  of  the  sum  of  squares  Y'Y?  Since  partial  sums  of 
squares  in  the  Np(0,Ip)  case  have  x"(k)  distributions  for  k < p,  the  first 
question  might  be  specialized  to:  What  quadratic  forms  Y'CY  have  the  x2(f) 

distribution  for  r as  large  as  possible? 
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It  is  convenient  for  the  study  of  quadratic  forms  to  use  a particular 


representation  of  Y ^ N ((),>;)  based  on  the  fact  that  ami  nimne.eia.tive 


definite  njimettic  matrix  iineh  cii  an  aAbitna.ni  covariance  matnix  >:)  ha j 


a unique  nonnegative  definite  njrmettu.c  iquarc  root.  To  see  this,  note 


V V- 

first  that  any  square  root  Z2  commutes  with  its  square  Z,  so  that  if  Z 2 is 


. 

symmetric,  Z and  Z can  be  simultaneously  diagonalized  by  some  orthogonal 
matrix  P.  From 


(4) 


PZP'  = 


= D 


PZ^'  = 


^ 2 H 

and  (Z  ) = Z it  follows  that  y^  = a 2 and  hence  that  all  symmetric 


nonnegative  definite  square  roots  have  the  form 


>2 


l2  = P1 


for  P and  D satisfying  (4)  and  nonnegative  square  roots  a2.  This  also 

% 

shows  that  such  a l exists.  The  P and  D in  (4)  are  not  unique,  but  are 


determined  up  to  permutations  of  the  a.  and  of  the  corresponding  rows  of 

4 , 


P.  It  is  easy  to  see  that  Z 2 is  unchanged  by  such  permutations  and  is 
hence  unique. 


We  can  thus  represent  Y 'v  Np(0,Z)  as  Y = l3l.  The  distribution  of 


quadratic  forms  in  Y is  completely  described  by  the  following  result. 


4 
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THEOREM  1.  Suppose  that  Y 'v  N^fO,^)  and  that  C <A  any  pxp  tyrmetnic 
matAix.  Then  the.  quadratic,  j \onxr\  Y'CY  haA  the  diitAibation  o ^ E?_^  X.^2, 
u)he/ie  the  Z i one  independent  N(0,1)  random  vatUableA  and  the  X ^ one  the 

j,  5, 

ehanact  exits  tie  nootA  o ^ E 2ce  2. 

PROOF.  Since  Y = E2Z, 

k h 

Y'CY  = Z'E  2CE  2Z  = Z'QZ. 

k k 

Because  Q = Z TE  is  symmetric,  there  is  an  orthogonal  matrix  P such  that 
PQP'  = D,  where  D is  diagonal  with  the  X^  as  diagonal  elements.  So  Q = P'DP 
and 

Y'CY  = Z'QZ  = (PZ) 'D(PZ) 

The  right  side  above  is  E? X^(Z?)2  where  Z?  is  the  ith  component  of  PZ. 

Since  PZ  ^ N (0,1  ),  this  is  a representation  of  the  desired  form. 

P P 

When  X ^ Np(y,E)  and  E is  nonsingular,  Y = X - y 'v  Np(0,E)  and  we 

obtain  as  a corollary  of  Theorem  1 Pearson's  result  that  the  quadratic 

form  (X-y)'E  1 (X-u)  appearing  in  the  exponent  of  the  density  function  has 

the  distribution  of  Z?_jZT,  which  is  x*”  (p)  • This  answers  our  first  question 

when  E is  nonsingular.  The  answer  to  the  second  question,  concerning  the 

distribution  of  the  sum  of  squares  Y'Y,  is  answered  by  setting  C = 1 in 

Theorem  1.  The  distribution  is  that  of  E?  . X.Z2,  where  the  X.  are  now  the 

"1  ii  l 

characteristic  roots  of  E itself. 

We  have  yet  to  answer  the  first  question  fully  by  extending  Pearson's 

k k 

recipe  to  the  singular  case.  Since  the  rank  of  E 2CE 2 cannot  exceed  the  rank 

(say  r)  of  i and  z2,  it  is  clear  that  if  YCY  -v  x (k) , then  k <_  r.  Theorem  1 

2 V V 

implies  that  Y'CY  -v  x (k)  if  and  only  if  E 2CE 2 is  idempotent  of  rank  k,  and 
a little  matrix  manipulation  shows  that  a sufficient  condition  for  this  is 
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1 

that  ZC  be  idempotent  of  rank  k.  This  general  result  is  not  very  helpful 

2 

in  the  search  for  C such  that  Y'CY  ^ x (r) . Pearson's  result  involved  the 
inverse  Z *.  It  turns  out  that  an  "inverse"  for  singular  matrices  neatly 
generalizes  his  recipe. 

For  an  arbitrary  nxm  real  matrix  A,  a Qan&uiLLz&d  ■Lnv&ibe.  of  A can 
be  defined  by  any  of  the  following  statements. 

(a)  A generalized  inverse  of  A is  any  mxn  matrix  G such  that 
x = Gy  solves  the  equations  y = Ax  for  any  y in  the  range 
(column  space)  of  A. 

(b)  A generalized  inverse  of  A is  any  mxn  matrix  G satisfying 
AGA  = A. 

(c)  A generalized  inverse  of  A is  any  mxn  matrix  G such  that  AG 
is  a projection  onto  the  range  of  A. 

It  is  not  hard  to  show  that  these  definitions  are  equivalent.  Definition 
(a)  justifies  the  concept  in  terms  of  solving  consistent  sets  of  linear 
equations  with  matrix  A.  Definition  (b)  is  convenient  for  matrix  manipulation, 
while  (c)  gives  some  geometric  insight.  Note  that  in  (c)  "projection"  does 
not  mean  the  unique  orthogonal  projection  onto  the  range  of  A,  but  any 
idempotent  matrix  with  this  range.  When  A is  not  a nonsingular  square 
matrix,  it  possesses  many  generalized  inverses.  Any  generalized  inverse  of 
A will  be  denoted  by  A . Generalized  inverses  are  widely  used  to  provide  a 
unified  notation  for  linear  statistical  problems  when  matrices  may  be 
singular.  The  following  theorem  and  its  proof  illustrate  the  convenience 
of  this  notion. 


i 
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THEOREM  2.  Suppose,  that  Y ^ Np(0,2)  and  Z ka.i>  hank  r.  Then 

(a)  With  pnxjbabUUtij  1,  Y'Z~Y  li  the  i>ame  fan  all  cholcet  o{>  Z~. 

2 

(b)  Y’Z  Y ^ x (r)  and  -Li  the  unique,  quadratic  fanm  having  thh> 
dittnlbutlon. 

PROOF.  (a)  If  x is  any  vector  in  the  range  of  Z,  so  that  x = Zy  for 
some  y,  then 

x’Z'x  = y'ZZ'Zy  = y'Zy 

by  definition  (b)  and  symmetry  of  Z.  So  x'Z’x  is  the  same  number  for  all 

u 

choices  of  Z whenever  x is  in  the  range  of  Z.  But  Y = Z2Z  is  in  the  range 
of  l (which  is  the  same  as  that  of  2 ) with  probability  1. 


(b)  Since  Y'Z  Y is  the  same  for  all  choices  of  Z~ , we  can  chose  a 
convenient  generalized  inverse.  If  Z has  rank  r and  positive  characteristic 
roots  dj,...,d  , there  is  an  orthogonal  P such  that 
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- k . k r»  ? 

Y ' E Y = Z’E  2Z  Z 2Z  = E , (Z?) 

1=1  l 

where  Z*  = PZ  -v  Np(0,Ip).  Thus  Y'E~Y  -v  x2(r)- 

2 

It  remains  to  show  that  Y'CY  'v  x (r)  implies  that  C is  a generalized 

2 y V 

inverse  of  E.  By  Theorem  1,  Y'CY  'v  x (r)  if  and  only  if  E 2CE  2 is  idempotent 

k k 

of  rank  r.  But  then  Z2CZ2  is  a projection,  and  since  its  range  is  contained 

y 

in  the  range  of  Z2  and  has  the  same  dimension  r,  it  is  a projection  onto  the 

y 

range  of  I2.  A projection  acts  as  the  identity  transformation  on  its  range, 
so 

u y y y y y 
ZCE  = Z2(Z2CZ2)Z2  = Z 2l2  = Z 

and  C satisfies  definition  (b)  of  Z . 

When  Y 'v  Np(y,E),  the  representation  Y = E 2Z  + y can  be  applied  to 
the  study  of  quadratic  forms  in  Y.  Repeating  the  argument  of  Theorem  1 
shows  that  Y’CY  has  the  distribution  of  a random  variable  of  the  form 


> A.Z2  + 2 V b.Z.  + 
“ 11  11 


where  Z <v  Np(0,Ip)  and  the  A^  are  as  in  Theorem  1.  These  distributions  have 
no  neat  classification.  We  will  make  only  one  foray  into  this  "noncentral 
case,"  to  look  again  at  Pearson's  recipe. 

When  Z -v  N (0,1  ),  Z'Z  'v  y2(p)  by  definition.  When  Y ^ N (y,I  ),  the 
P P P P 

distribution  of  Y’Y,  or  equivalently  of  (Z+y)’(Z+y),  is  defined  to  be  the 
noncent/ml  ehi-AquaAe.  cLLifriibution  uisith  p de.gn.e.est>  ofa  (jAeedom  and  noncentsualsCty 
paAameXeA  6 = y'y.  (Since  the  statistic  is  the  square  of  the  distance  of  Z 
from  the  point  -y  in  R*5,  it  follows  from  the  circular  symmetry  of  the 
density  function  of  Z that  this  distribution  depends  only  on  the  distance  of 
-y  from  the  origin.  Thus  parameterizing  the  distribution  by  (p,6)  is 
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2 

justified.)  We  will  use  the  notation  Y'Y  ^ x (p» - 

Suppose  now  that  Y ^ Np(y,E)  and  E has  rank  r.  What  then  is  the 
distribution  of  the  generalized  Pearson  statistic  Y'l  Y?  Alas,  since 
Y = 1^1  + y,  Y is  not  in  the  range  of  Z unless  u is,  so  that  the  quadratic 
form  Y'E~Y  changes  with  the  choice  of  E~.  If  y is  in  the  range  of  Z,  we 

j, 

can  write  y = Z *v  and  follow  the  argument  of  Theorem  2 to  show  that  Y'Z  Y 
is  well  defined  and  that 

Y'Z'Y  = (Z+v)  ’Z^Z'Z^Z+v) 

= It Z.  + v )2  * x2(r,6) 
i=l 

r 2 

where  6 = Z v^.  But  by  the  same  argument, 

l 2 h - h 

> vT  = v ' Z 2Z  E 2v  = y ' E y . 
i-1  1 

2 

Thus  Y'Z  Y ^ x y) . When  y is  not  in  the  range  of  Z,  both  the  form 

and  the  distribution  of  Y'Z  Y vary  with  the  choice  of  Z . Of  course,  when 
£ is  nonsingular  these  complications  do  not  arise,  and  Y'E  Y *\»  x (p,y'Z_  y) 
for  any  mean  vector  y. 


We  have  concentrated  on  cases  in  which  the  distribution  of  Y’CY  given 
by  Theorem  1 (or  more  generally  by  (5))  reduces  to  a chi-square  distribution. 
There  are  sound  practical  reasons  for  doing  so,  even  though  machine  compu- 
tation makes  it  feasible  to  produce  tables  of  critical  points  for  the 
distributions  of  Tests  of  fit  based  on  quadratic  forms  in 

(approximately)  multivariate  normal  random  variables  are  the  natural 
generalization  of  Pearson's  chi-square  test.  These  tests  must  compete  for 
the  attention  of  practical  statisticians  against  special-purpose  tests  for 
fit  to  specific  common  families,  and  against  general  tests  of  fit  based  on 
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the  empiric  distribution  function  (.EDF  tests).  These  competitors  are 
usually  much  more  powerful  than  chi-square  tests,  but  are  also  much  less 
flexible  in  adapting  to  unknown  parameters  and  discrete  or  multivariate 
data.  In  particular,  they  require  separate  computation  of  critical  points 
for  each  hypothesized  family.  Cl  will  not  mention  - this  is  a rhetorical 
device  I learned  from  Cicero  - that  the  EDF  tests  break  down  almost 
completely  when  faced  with  hypothesized  distributions  which  are  multi- 
dimensional or  are  not  location-scale  families.)  If  a test  of  chi-square 
type  also  requires  a special  computation  of  critical  points  to  be  applicable 
to  a given  problem,  we  would  usually  be  wiser  to  allot  our  computer  time  to 
an  EDF  test  instead.  Thus  generalizations  of  Pearson's  statistic  lose  much 
of  their  attractiveness  if  their  critical  points  cannot  be  found  in  standard 
tables.  In  the  light  of  Theorem  1,  the  relevant  tables  will  be  those  for 
the  chi-square  distributions. 

3.  LARGE  SAMPLE  THEORY 

Since  the  earliest  days  of  statistics  it  has  been  noticed  that 
complicated  distributions  often  have  simple  approximations  for  large 
samples.  The  distribution  of  Pearson’s  chi-square  statistic  is  an 
example.  The  use  of  chi-square  tests,  both  as  tests  of  fit  and  for  other 
common  applications,  is  based  on  approximating  multinomial  distributions 
by  the  multivariate  normal  distributions  which  are  their  limits  as  the 
sample  size  increases.  We  will  therefore  review  some  facts  about  statistical 
large  sample  theory.  There  are  three  major  aspects  to  this  theory.  The 
first  simply  asks  questions  of  convergence:  "What  happens  in  the  limit?" 


The  second  studies  the  approach  to  the  limit  by  providing  rates  of  convergence, 
asymptotic  expansions,  etc.  The  third  considers  the  usefulness  of  the 
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asymptotic  forms  provided  by  the  first  two  parts  of  the  subject  as 
approximations  to  the  fixed  sample  size  truth.  Explicit  numerical 
calculation  and  simulation  play  large  roles  here.  Only  the  first  aspect 
of  large  sample  theory  will  concern  us,  both  for  simplicity's  sake  and 
because  (to  make  an  appalling  generalization]  in  the  field  of  chi-square 
tests  the  second  aspect  has  had  little  practical  impact  and  the  third  has 
shown  that  use  of  limiting  distributions  is  an  adequate  approximation  for 
quite  moderate  sample  sizes. 

The  most  useful  mode  of  convergence  for  statistical  use  is  convergence 
in  distribution.  If  X^,  X2,...  are  R^-valued  random  variables,  X^  having 
df  Fn,  we  say  that  the  sequence  conveAg&i  -in  d-U-tAibution  to  the  distribution 
having  df  F if  F^fx)  F(x)  for  every  continuity  point  x of  F.  Abusing 
notation  to  also  denote  by  F,  Fn  the  probability  measures  on  generated 
by  these  df's,  convergence  in  distribution  is  equivalent  to 

lim  P[X  in  A]  = lim  F (A)  = F (A) 
n 1 n J n n^  J K J 

for  all  Borel  sets  A in  R*3  whose  boundaries  have  probability  zero  under  F. 

Thus  P[*n  in  A]  can  be  approximated  by  F(A)  for  large  n.  Convergence  in 
distribution  to  the  distribution  placing  probability  1 on  a single  point,  c 
is  equivalent  to  conveAgmcz  -in  pxobab-itiXy  of  X^  to  c.  That  is,  for  any 
e > 0,  P [ | Xn~c  | > e]  + 0 as  n + «.  (We  write  this  X^  -*■  c(P).)  All  of  this 
is  of  course  a province  of  the  measure  theory  which  underlies  statistical 
theory  and  sometimes  invades  the  conscious  thought  of  the  working  statistician. 
A nice  exposition  in  effortless  generality  appears  in  the  first  chapter  of 
[1]. 
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We  require  only  one  specific  and  two  general  facts  about  convergence  in 
distribution.  The  specific  fact  is  the  muitivaAiate  centAOJt  Limit  theonem: 
1)5  Xj,  X2>...  one  independent  R^- valued  random  vaAiabtes  having  a common 
distAi.buti.on  with  vectoA  ol  means  y and  finite  pxp  covaniance  matrix  Z,  and 
■i&  Xn  = n the*  n5^  ' ^ aonveAges  in  distnibulion  to  CO, E) . 

This  is  written 

h - 

n2(Xn  - y)  -v  Np(0,E). 


We  often  abuse  notation  and  write  instead 


N 2> 
y)  - Y 


where  Y 'v  N (0,E),  even  though  convergence  in  distribution  makes  no  statement 

h - 

about  convergence  of  values  of  n (X  - y)  or  about  any  limiting  random 
variable. 

& 

The  essential  general  fact  is  the  continuity  theorem:  I ^ ->  Y and 

h:  R*5  -*•  Rk  is  continuous  with  pAobability  1 with  nespect  to  the.  distribution 


& 

0)5  Y,  then  h(Yn)  -*■  h(Y).  The  continuity  theorem  licenses  our  natural  desire 

to  conclude  that  when  (say)  Y^  is  approximately  Np(0,£),  then  Y^CY^  has 

approximately  the  distribution  specified  by  Theorem  1.  The  central  limit 

theorem  provides  us  with  a large  supply  of  random  variables  which  are 

approximately  multivariate  normal.  The  two  together  suffice  to  make  rigorous 

Pearson's  proof  outlined  in  Section  1 above. 

The  second  general  fact  is  needed  to  still  the  clamoring  voices  of  the 

Q)  3) 

pedants.  1)5  X^  ->•  X and  Y^  c(P),  then  (X^.Y^)  (X,c).  That  is,  convergence 

of  both  marginal  distributions  of  (X  ,Yn)  suffices  for  convergence  of  the 
joint  distribution  i&  one  sequence  of  marginal  distributions  has  a degenerate 
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limit.  Convergence  of  marginal  distributions  in  general  gives  no  information 
about  the  joint  distribution.  The  natural  manipulations  we  wish  to  make 

are  all  licensed  by  these  two  general  facts.  For  example,  if  X -*■  X and 

3>  n 

R -*  0 CP) , then  X + R -*  X,  as  reason  and  justice  demand.  For 
n n n 

3 

(X  , Rn)  •*  (X,0)  by  the  second  fact,  and  the  continuity  theorem  now  applies 
with  h(x,y)  = x+y. 

The  following  section  will  provide  examples  in  plenty  of  the  way  in 
which  the  three  facts  mentioned  here  combine  with  the  law  of  large  numbers 
and  Taylor's  theorem  to  form  the  elementary  arithmetic  of  statistical  large 
sample  theory. 

4.  CHI-SQUARE  TESTS  OF  FIT 

Returning  at  last  to  the  problem  and  notation  of  Section  1,  we  wish  to 
test  whether  independent  random  variables  X^,...,X  taking  values  in 
Euclidean  p-space  Rp  have  df  G(-|0)  for  some  0 in  SI,  an  open  set  in  Rm. 
Partitioning  Rp  into  M cells  E1,...,EM,  we  denote  by  NL  the  number  of 
Xi»...»Xn  falling  in  E^  and  by  p^(0)  the  probability  that  a random  variable 
with  df  G C * 1 0)  falls  in  E^.  The  vector  of  standardized  cell  frequencies 
is  the  M-vector  V (0)  with  ith  component 

* npi(0) 

[npi (0) ] 2 

Finally,  0 is  estimated  from  X.,...,X  by  0 = 0 (X, , . . . ,X  ) , and 

l n'n  n 1 n 

•• • >Xn)  is  a possibly  data-dependent  non-negative  definite 
symmetric  MxM  matrix.  StatUAticA  o&  chi- A quale.  type,  axe  AtatiAticA  ofi 
the.  farm 

V (0  )’C  V (0  ), 
nv  n n nv  n' ’ 


(6) 


that  -C6,  non-ntQcutive.  definite  quadAntic.  vmi  in  the  ttandandized  cell 

^equeneiei . 
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If  the  vector  vn(6n)  has  a limiting  NM(O,E(0Q))  distribution  when 

G(-  |e0)  is  the  true  df,  and  if  -*■  C(0q)(P),  then  the  continuity  theorem 

tells  us  that  the  limiting  distributions  of  statistics  of  chi-square  type 

under  the  null  hypothesis  are  completely  described  by  Theorem  1.  Establishing 

asymptotic  normality  of  Vn(0  ) is  therefore  the  primary  mathematical  hurdle 

in  the  theory  of  chi-square  statistics.  When  0,  or  more  precisely  the 

vector  p(0)  = (pj (0) , . . . ,pM(0) ) ' , is  known,  this  hurdle  is  low  indeed.  For 

the  have  a multinomial  distribution  with  parameters  n and  p(0).  The 

vector  (Nj , . . . ,N^)  can  be  expressed  as  the  sum  of  n independent  M-dimensional 

indicator  variables  <5.,. ..,<5  where  <5.  has  ith  component  1 and  all  others  0 

in] 

when  Xj  falls  in  cell  E^.  It  follows  from  a computation  of  covariances  and 
the  multivariate  central  limit  theorem  that  under  G(*|0) 

(7)  Vn(0)  ^ NM(0,IM  - q(0)q(0)'), 

where  IM  is  the  MxM  identity  matrix  and 

q(0)  = (P1(e)\...,PM(e)V. 

This  is  just  the  multivariate  normal  approximation  to  a multinomial  distribution, 
expressed  in  a notation  which  will  prove  convenient  for  easy  extension  to  the 
more  common  case  when  0 must  be  estimated. 

In  that  latter  case,  the  asymptotic  behavior  of  will  depend  on 

that  of  9 , as  Fisher  recognized.  Thus  the  large  sample  theory  of  chi-square 
statistics  draws  on  the  large  sample  theory  of  estimators,  a main  current  of 
statistical  theory  since  Fisher's  time.  Because  of  the  importance  of  this 
subject,  and  to  illustrate  the  application  of  the  principles  stated  in 
Section  3,  there  follows  an  account  of  the  large  sample  behavior  of  the 
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mtndjnurn  ckt-AquaXZ  zAtimatoX  used  in  the  classical  Pearson  test. 


For  given  Nj,...,Nm  this  estimator  is  any  value  of  0 which  minimizes 
the  Pearson  statistic 

M [N.-np.(e)]2 

P (6)  = V (0)'V  (9)  = y — . 

n nv  np^fl) 


It  is  intuitively  clear,  and  not  hard  to  prove,  that  for  such  purposes  as 
studying  V (0  ),  the  minimum  chi-square  estimator  is  asymptotically 
equivalent  to  the  mlyiimum  modl^izd  cJaL-6qujxXz  ZAttmatoX  6^  which  minimizes 
the  modified  chi-square  statistic 

8 [N,-np.(8j]2 

Ve>  ■ .1  -SfT — • 

1=1  1 

Working  with  0^(0)  is  arithmetically  simpler  and  conceptually  identical  to 
working  with  Pn(0J.  We  will  therefore  study  0^,  which  we  assume  to  exist 
and  be  a measureable  function  of  The  first  question  concerns 

zonAtAtznzij  of  this  estimator  - does  6n  approach  the  true  value  of  6 as  n 
increases? 

3p. 

LEMMA  1.  Suppose  that  M >_  m,  that  each  — i-  iA  continuouA  at  0 = 9,., 

3p.  ?6k  0 

and  that  the,  M*m  matAix  jg—  (0q)  hcu>  nank  m.  Then  any  rntyUrnci/ri  modifite.d 

k 

dvi-Aquaxz  zAtlmatox  §n  AattA^izA  6n  ->  0Q(P)  when  G(-|eQ)  Ia  the  txuz  d £ 

ofi  the  X. . 

J 

PROOF.  By  the  law  of  large  numbers. 


V11  Pi(0o) 


i = 1 , . . . ,M 


and  therefore  by  the  continuity  theorem 


M [N./n  - p.(e0)): 

1J/n  “ L M 


0(P). 
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But  by  the  definition  of  0 , 

o i V5„)/n  i W/n 

and  so  (^(i^/n  0(p)>  This  can  only  happen  if  I'T/n  - P^(9n)  "*■  0(P)  for 

i = This  with  (8)  implies  that  P(6n)  p(0q)(p),  which  in  turn 

- m M 

implies  that  0^  ■+  0g(P)  if  the  function  0 ■+  p(0)  from  R to  R has  a 

continuous  inverse  at  0 = 9^,  using  the  continuity  theorem  once  again.  The 

hypothesis  of  the  lemma  is  sufficient  for  the  existence  of  a continuous 

inverse.  (The  proof  of  this  analytical  fact  is  similar  to  that  of  the 

familiar  inverse  function  theorem  for  the  M = m case:  The  continuous 

derivatives  make  the  transformation  locally  linear,  and  the  rank  condition 

suffices  when  the  transformation  is  linear.) 


The  actual  large  sample  form  of  is  given  by  the  following  theorem. 

The  result  is  Fisher's,  but  a rigorous  proof  first  appeared  in  Cramer's 
classic  book  [3]  in  1946.  The  proof  provides  as  a bonus  an  expression  for 
Vn(0  ) ^°r  8enera*  estimators  0 . Denote  by  B(0)  the  Mxm  matrix  with  (i,k)th 
entry 


Pi(0) 


-Ja  3PiW 


30, 


In  analogy  with  the  common  o(l)  notation  from  analysis,  0^(1)  denotes  any 
quantity  converging  in  probability  to  zero  as  n increases.  From  this  point, 
we  shall  for  brevity  omit  the  argument  0 when  9 = 6 . Thus,  for  example, 

B = B(9g)  and  (0^)  in  the  statement  of  the  following  theorem. 

THEOREM  3.  UndeA  the  conditions  oft  Lemma  1,  when  Gf-Jo^)  hoids, 

n%(0n  - 90)  = (B'B)_IB'Vn  +0  (1). 


(9) 
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Now  B'B  is  nonsingular  by  the  rank  m assumption,  and  since  the  determinant 
of  a matrix  is  a continuous  function  of  its  elements. 


detfB'B  + 0p(l))  -*■  det  (B ' B)  * 0(P). 

Hence  if  A^  is  the  event  that  B'B  + op(l)  nonsingular,  and  Xn  the 
indicator  function  of  this  event,  then  the  probability  of  under  G(*|o  ) 
approaches  1,  x^  -*■  1 (P) , and  [B'B  + op(l)]  *XR  ^ (B'B)-1  (?)  - Applying 
[B'B  + o (1)]  1Xn  to  both  sides  of  (13)  gives 


i i- 

I ~ 1 o » \r  „ 2 /■ 


[B'B  + o (1)]  X B'V  - n (0  - 6Jx  = o (1) 

L pv  J n n n O n p 


which  in  turn  implies  the  result  of  the  theorem. 


Theorem  3 yields  immediate  fruit,  and  (12)  will  produce  a later  harvest 

as  well.  Reviewing  the  proof,  it  is  easy  to  see  that  (12)  holds  for  amj 

consistent  estimator  6 of  9 in  the  form 

n 


(!“)  vn(8n)  . Vn  - Bn‘i(9n  - e0)  . Op(l)„‘>(8n  - e0)  . Op(l). 


This  is  the  central  relation  in  the  theory  of  chi-square  tests,  as  it 

expresses  V (6  ) in  terms  of  the  standardized  multinomial  vector  V and  a 
n n n 

separate  term  reflecting  the  effect  of  estimating  0.  Notice  that  the  third 

i, 

term  on  the  right  is  op(l)  whenever  n2(©n  - 9^)  converges  in  distribution. 

We  can  now  provide  quick  proofs  of  several  important  results. 

The  first  of  these  is  Fisher's  solution  to  the  question  of  the  behavior 
of  Pearson's  statistic  when  0 is  estimated  by  0 . Substituting  (9)  into 
(14),  we  see  that 


VV  " <'m-  BtB-B)-*B')Vn  . op(l>. 
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Asymptotic  normality  for  was  immediate  (see  (7)).  By  the  continuity 
theorem  and  the  result  from  Section  2 on  linear  transformations  of 
multivariate  normal  variables,  it  follows  that  under  G(*|0g), 

VV  * 

E = (IM  - B(B'B)'1B')(Im  - qq')(IM  - B(B'B)"V) 

= IM  - qq'  - B (B 1 B) _1B ' 

The  last  equality  is  a consequence  of  the  important  relation  q'B  = 0,  which 

holds  because  E^  p.  = 1 implies  that  E^  3p./39,  = 0 for  each  k.  The 
i=l  ri  r i=l  ri  k 

limiting  null  distribution  of  any  statistic  C ®n) * C V (0  ) is  now  given 

by  Theorem  1.  In  particular,  the  Pearson  statistic  P (0  ) = V (0  ) 'V  (0  ) 

J r > n n n n n n 

M 2 

has  the  distribution  of  E.  , A.Z.  where  A.  are  the  characteristic  roots  of 

i=l  ii  i 

E.  A bit  of  matrix  multiplication  will  show  that  qq'  and  B(B'B)  1B'  are 
symmetric  idempotent  matrices,  that  is,  orthogonal  projections.  Moreover, 
we  just  saw  that  they  are  orthogonal  to  each  other.  Because  qq'  has  rank 
1 and  B(B'B)  *B'  has  rank  m by  assumption,  E is  an  orthogonal  projection  of 
rank  M-m-1.  So  its  characteristic  roots  are  M-m-1  l's  and  m+1  0's,  and  the 
limiting  null  distribution  of  P^(0n)  is  X*" (M-m-1).  Notice  especially  that 
this  is  true  for  any  0^  in  ft,  even  though  E varies  with  0^.  This  is  the 
famous  "subtract  one  degree  of  freedom  for  each  parameter  estimated"  result. 

Now  0^  is  often  not  the  most  convenient  available  estimator  of  0.  In 
testing  fit  to  the  univariate  normal  family  with  0 = (u,a),  for  example,  the 
cell  probability  for  a cell  = (a^  ^a^]  is 

a.-y  a.  .-u 

p.fy.a)  = *(-i— ) - *(-~- ), 

where  4>  is  the  standard  normal  df.  The  equations  (10)  have  no  closed- form 
solution,  nor  do  the  yet  more  complicated  equations  3Pn(0)/30j,  = 0 defining 


27 


the  minimum  chi-square  estimator.  A visit  to  your  local  computing  center 


will  uncover  library  programs  for  evaluating  <5>  and  iteratively  solving  the 
equations  CIO).  Nonetheless,  it  is  hard  to  ignore  the  universally  used 
sample  mean  and  variance,  0^  = (X,s).  What  will  befall  us  if  we  use  these 
estimators  in  the  Pearson  statistic  instead  of  0^?  To  answer  this  question, 
we  must  discover  the  large-sample  behavior  of  9^  and  then  consult  (14). 

This  is  best  done  in  greater  generality.  The  sample  mean  and  standard 
deviation  (taking  s‘  = £j  = 1(X..-X)  /n)  form  the  maximum  LikdUkood  Utancvto^ 
(MLE)  of  9 = (u,a)  in  the  univariate  normal  family.  In  general,  the  MLE 
9n  = ®n(Xj » • • • » Xn)  of  0 is  defined  as  any  value  of  6 maximizing  the  joint 
density  function  of  the  observations  considered  as  a function  of  9 for  given 
Xj,...,X  . This  recipe  for  a general  method  of  estimating  parameters  is 
another  of  Fisher's  contributions.  It  is  intuitively  forceful,  estimating 
0 to  be  the  value  making  the  actually  observed  Xj,...,X  "most  probable." 
More  satisfying  to  the  perverse  theoretician,  the  MLE  is  guaranteed  to  have 
good  properties  in  large  samples.  Specifically,  suppose  that  the  X^  are 
independent  with  common  density  function  g(-|e  ).  Then  under  reasonable 
smoothness  conditions, 


n2(en  ‘ V = J(eo 


n a log  f (X  1 0 ) 

) n l=i 30  " ~ + °p(l)* 


Here  3 log  f/30  is  the  m-vector  of  partial  derivatives  with  respect  to 
0l,...*6m  and  as  t*ie  mxm  matrix  with  (k,£)th  component 


F r3  log  f (x | e ) 3 log  f ( X [ 9 ) , 

0l  30  an  J* 


It  follows  from  (15)  by  the  multivariate  central  limit  theorem  that 


28 


The  matrix  J(0)  is  called  the  information  matnlx  for  the  family  f(x|e). 

The  inverse  J(0q)  ^ is  the  "smallest  possible"  covariance  matrix  for  the 
limiting  distribution  of  an  estimator  of  0,  in  several  specific  senses 
which  this  is  not  the  place  to  specify.  Thus  (16)  says  roughly  that  the  MLE 
has  the  tightest  possible  concentration  about  the  true  0Q  in  large  samples. 
This  is  called  asymptotic  cadency  of  the  MLE. 

In  the  light  of  this  pleasing  result,  it  would  be  very  intelligent,  if 
we  wish  to  estimate  0 from  cell  freqm  cies,  to  apply  the  MLE  recipe  to  the 
indicator  variables  6^,...,6n  indicating  into  which  cells  Xj,...,X  fall. 

A bit  of  work  shows  that  the  information  matrix  in  this  case  is  B'B,  and 
that  (15)  reduces  to  (9) . The  minimum  cki-Squate  and  minimum  modified 
chi-square  and  maximum  tckeJUh.ood  estimators  ate  all  asymptotically 
equivalent  ways  estimating  3 fiAom  the  cell  fiaequencies . That's 
aesthetically  satisfying. 

Having  summed  up  half  a century  of  hard  work  on  MLE's  in  one  paragraph, 
we  can  now  substitute  (15)  into  (14).  Here  is  the  MLE  of  0 from  the 

ungrouped  observations  Xj,...,X  , not  the  less  efficient  MLE  based  on  the 
cell  frequencies.  Fortune  is  with  us.  The  first  term  in  (14),  namely  V^, 
was  expressed  at  the  beginning  of  this  section  as  a sum  of  n terms,  one  for 
each  X^.  The  second  term,  namely  (15),  has  the  same  form.  And  the  rest  of 
(14)  is  op(l) . So  we  obtain  from  the  first  two  terms  a sum  which  is 
asymptotically  normal  by  the  multivariate  central  limit  theorem.  A 
computation  of  covariances  gives  specifically  that 


W - 


I - IM  - qq*  - BJ~ 1 B ' . 


/v  M2 

Therefore  the  limiting  null  distribution  of  pn(en)  is  that  of 

where  X.  are  the  characteristic  roots  of  E. 
x 

Now  BJ  *B’  has  the  same  rank  m as  does  B,  and  therefore  has  as  its 
range  the  range  of  B.  Since  qq'  is  an  orthogonal  projection  of  rank  1 
orthogonal  to  B,  the  characteristic  roots  of  E include  M-m-1  l's  (E  acts 
as  the  identity  in  directions  orthogonal  to  the  direct  sum  of  the  ranges  of 
B and  qq')  and  one  0 (E  acts  as  zero  on  the  range  of  qq').  The  remaining  roots 
X i » • • • » X^  reflect  the  fact  that  E acts  as  I„-BJ  1 B on  the  range  of  B.  One  versi< 
of  the  "efficiency"  of  the  MLE  is  that  0^  is  asymptotically  preferable  to  0^ 
in  the  sense  that  J-B'B  is  nonnegative  definite.  From  this  it  can  be  shown 
by  matrix  mangling  that  0 X^  < 1,  and  0 < X^  < 1 except  in  the  unusual 

case  when  J-B'B  fails  to  be  positive  definite.  The  X.  of  course  depend  on 
6q,  as  well  as  on  the  specific  hypothesized  family  g ( * I 6) . 

We  have  now  reached  the  second  major  consequence  of  (.14).  The  statistic 
Pn C ^as  as  ats  lifting  null  distribution  the  distribution  of 


(17)  X (M-m-1)  + 1 X .ZT. 

i=l  1 

This  is  not  a chi-square  distribution.  What  is  worse,  the  distribution  varies 
with  9q,  so  there  is  no  single  limiting  distribution  across  the  composite 

null  hypothesis.  Since  0 <_  X^  < 1 for  all  6^,  it  is  at  least  true  that 

'j  2 

critical  points  of  (17)  lie  between  those  of  x*"  (M-m-1)  and  x*"(M-l).  When 
there  are  many  cells  and  few  parameters,  these  bounds  are  close  together. 

But  we  cannot  without  care  follow  such  natural  paths  as  the  use  of  X and  s 
in  the  Pearson  statistic  to  test  for  normality. 

After  Chemoff  and  Lehmann  [2]  obtained  the  result  (17)  in  1954, 
statistical  theory  produced  a variety  of  ways  of  escape.  One  is  suggested 
immediately  by  Theorem  2:  Compute  a generalized  inverse  of  E and  use  the 


corresponding  quadratic  form, 
of  E that  E has  rank  M-l  and 


It  is  easy  to  see  from  our  previous  study 


Z"  = (IM  - BJ'V)"1 


whenever  J - B'B  is  positive  definite.  If  now 


C 

n 


- »(5n)J(V-1B(5n))-1 


then  C -*■  Z (P)  and 
n 


V (0  ) 'C  V (6  ) 
n^  n n n n 1 


X (M-l) 


under  G(*|0)  for  any  0 in  U This  statistic  is  not  as  hard  to  compute  as 
may  appear,  as  will  be  shown  by  example  in  Section  7.  This  statistic  was 
first  studied  by  Rao  and  Robson  [5],  but  without  the  supporting  theory. 

Rao  and  Robson  present  Cn  in  the  form 

C„  ' [M  * '»,1WV  - B(6nVB(9nJ]-lBCenV 

which  makes  it  clear  that  the  new  statistic  V (0  ) 'C  V (6  ) is  the  Pearson 

n n n n n 

statistic  plus  a second  quadratic  form.  Challenge:  prove  that  the  two 

expressions  given  for  C are  equivalent. 

If  ^nC9n)  can  be  built  up  to  reach  x (M-l),  it  can  also  be  chopped 

2 _i 

down  to  x (M-m-1).  Since  B(B'B)  B'  is  the  orthogonal  projection  onto  the 

range  of  B,  you  should  be  able  to  show  that  V (0  )'D(0  )V  (0  ) has  the 

n n v n n n 

2 

X (M-m-1)  limiting  null  distribution,  where 

D(0)  = IM  - B(0) [B»  C0)B(0)]-1B* (0)  . 


This  result  does  not  even  depend  on  the  use  of  0 : 0 and  most  other 

n n 

estimators  of  0 give  the  same  result.  But  the  price  of  such  generality  is 
inefficiency.  Simulations  suggest  that  the  D(0n)  statistic  often  has  low 
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power  (that  is,  little  ability  to  detect  that  the  null  hypothesis  is  false). 
The  Cn  statistic,  on  the  other  hand,  is  usually  more  powerful  than  the 
Pearson  statistic.  It  deserves  consideration  as  a standard  chi-square  test 
for  goodness  of  fit. 


5.  CONTINGENCY  TABLES 

The  use  of  chi-square  statistics  for  testing  fit  is  based  on  creating 
a set  of  multinomial  observations  by  counting  cell  frequencies.  Because 
only  cell  frequencies  are  used  in  the  cests,  some  information  is  lost. 

There  are  other  classes  of  tests  of  fit  which  are  generally  more  powerful 
than  chi-square  tests,  though  none  so  flexible  and  widely  applicable.  There 
are,  however,  situations  in  which  multinomial  observations  arise  naturally. 
In  such  cases,  chi-square  tests  are  the  natural  large  sample  tests.  A 
common  instance  is  a contingency  tab-tc:  sample  units  are  categorized 

according  to  two  or  more  variables  with  the  intent  of  discovering  the 
relationship  between  the  variables.  The  data  consist  of  the  frequencies 
of  sample  units  in  all  possible  cross-classifications.  Here  is  the  layout 
of  a 2xs  contingency  table,  with  the  notation  used  for  the  cell  frequencies. 


(18) 


N11 

N12 

Ni 

Is 

N 

*21 

N22 

. . . 

N, 

2s 

N 


1 


N 

• s 


We  have  used  the  common  notation  in  which  a dot  replaces  an  index  when 
the  frequencies  are  summed  over  the  full  range  of  that  index.  Thus  N is 
the  jth  column  sum,  the  total  number  of  units  which  fell  in  category  j for 


the  column  variable.  For  simplicity,  this  2xs  table  will  be  the  focus  of 
our  discussion,  though  the  conclusions  are  generally  valid. 
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What  is  the  proper  probability  model  for  these  data?  The  model  rni&t 
fie^leeX  the  way  in  which  the  data  wefie  collected.  There  are  several 
different  sampling  procedures  which  could  lead  to  the  table  (18).  A single 
random  sample  of  size  n might  be  selected,  then  categorized  in  two  ways. 

For  example,  a random  sample  of  persons  being  treated  for  cancer  might  be 
classified  by  sex  (2  categories)  and  type  of  cancer  (s  categories).  Call 
this  Model  A.  Under  Model  A,  the  cell  frequencies  have  a single 
2s-nomial  distribution.  The  marginal  frequencies  are  all  random,  and 
satisfy 


s 2 s 

(19)  N.  + N-  = y N . = I I N.  . = n. 

1*  2.  ,L.  -j  >.  li 

j=l  J 1=1  j=l  J 


Table  (18)  might  also  result  from  selecting  two  independent  random 
samples,  of  male  cancer  patients  and  female  cancer  patients  separately, 
then  categorizing  each  patient  by  type  of  cancer.  Under  this  Model  B, 
table  (18)  contains  two  independent  s-nomial  distributions.  Although  (19) 
still  holds,  Nj.  and  ^ are  no  longer  random,  for  they  are  the  sample 
sizes  chosen  by  the  experimenter.  Model  C reverses  the  roles  of  the 
variables:  choose  independent  random  samples  of  patients  under  treatment 

for  each  of  s types  of  cancer,  then  categorize  each  by  sex.  Here  there  are 
s independent  binomials,  and  the  are  nonrandom  sample  sizes. 


All  three  models  for  table  (18)  are  sets  of  independent  multinomial 
observations.  Chi-square  methods  provide  tests  of  hypotheses  concerning  the 
cell  probabilities  in  any  such  setting.  This  is  a generalization  of  the 
situation  arising  in  tests  of  fit,  where  only  a single  multinomial  sample 
was  available,  but  the  theory  of  chi-square  tests  follows  much  the  same 


line. 
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Hypotheses  for  these  models  are  stated  in  terms  of  the  cell  probabilities 
p_  for  the  sampled  population,  bach  model  imposes  different  constraints  on 
the  p^ . Model  A requires  only  that 


(20) 


2 s 

l l P„  = 1 

i-1  3=1 


li 


(and  of  course  that  0 <_  p^  <_  1 for  all  i and  j).  Model  B states  that 

1 


(21) 


pi-  3 J,  Pij  3 
3=1  J 

s 

P2.  = l P2j  = 
j=1 


1, 


since  in  this  case  two  independent  s-nomials  are  observed.  Model  C assumes 
instead  that  p^  = p^  + p = 1 for  each  j.  The  most  common  hypotheses 
(and  the  only  ones  we  will  consider)  formalize  the  statement  that  there  is 
no  connection  between  the  two  categorizations  - in  the  example,  no  connection 
between  the  sex  of  a cancer  patient  and  the  type  of  cancer  under  treatment. 

In  Model  A,  this  is  the  hypothesis  of  ■inde.pe.ndenc.e., 

(22)  Ha:  pi^  = Pi.P.j  i = 1,  2 and  j = l,...,s. 

In  Model  B,  the  hypothesis  is  that  of  two  identical  i-nombal  diiViibLUtioni, 


(23)  »B:  Pl.  = p2. 


3 l,...,s. 


For  Model  C,  no  connection  between  categorizations  is  expressed  as  the 
hypothesis  of  4 identical  binomial  dtAtAdbutsLonA , 

Hr:  P 1 1 Pi2=‘ ' • =P>  c • 


Is 


In  all  of  this,  our  concern  has  been  simply  to  translate  the  sampling 
design  and  the  question  to  be  asked  of  the  data  into  a mathematical  model. 
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This  process  is  often  less  clear  and  more  controversial  than  the  theory 
which  follows  from  the  model  selected.  There  are,  for  example,  yet  other 
models  for  the  data  of  table  (18).  These  models  assume  that  the  data  arise 
not  as  random  samples  from  a large  population,  but  from  experimental 
randomization  of  a finite  set  of  experimental  units.  In  the  randomization 
analog  of  Model  B,  n units  (say  lab  rats)  are  available,  of  which  are 
assigned  to  Treatment  1 and  to  Treatment  2 by  random  allocation.  The 
response  of  each  rat  falls  into  one  of  s categories,  and  is  the  number 
of  rats  receiving  Treatment  i with  response  j resulting.  Just  as  in  Model  B, 
(19)  holds,  N1#  and  N9>  are  fixed,  and  the  hypothesis  to  be  tested  is  that 
of  equal  response  distributions.  But  because  the  responses  of  the  fixed 
set  of  rats  are  dependent,  the  distribution  of  the  N^  is  no  longer  multi- 
nomial. Moreover,  under  the  null  hypothesis  of  "no  treatment  effect",  the 
number  N > of  rats  showing  response  j is  a nonrandom  characteristic  of  the 
particular  set  of  rats  used  in  the  experiment. 

Now  it  turns  out  that  under  Model  B the  conditional  null  distribution 


of  the  given  the  observed  values  of  (j  = l,...,s)  is  exactly  the 
null  distribution  of  the  under  the  randomization  model  just  described. 
Because  such  experimental  randomization  is  common  practice  in  many  fields 
of  work,  it  has  been  the  historical  pattern  to  argue  that  the  randomization 
models  are  "exact"  and  the  multinomial  models  (and  therefore  the  chi-square 
tests)  are  valid  only  when  interpreted  conditionally.  But  wait  - since  the 
randomization  model  considers  only  the  fixed  set  of  units  actually  involved 
in  the  experiment,  any  inference  based  on  that  model  can  apply  only  to  those 
particular  units.  If  our  n rats  are  in  some  way  atypical  of  rats- in-general , 
this  will  influence  the  outcome  of  the  experiment,  and  no  conclusions  can  be 
drawn  for  rats-in-general . In  practice,  we  argue  or  assume  that  our 
particular  units  are  representative  of  some  larger  population.  That  is,  we 
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commonly  act  as  if  we  had  samples  from  a population  of  interest.  What  is 
more,  steps  are  often  taken  to  justify  this  assumption  - we  cannot  sample 
the  population  of  all  rats  or  all  cancer  patients,  but  we  can  select  our 
experimental  units  at  random  from  a large  pool  of  accessible  units.  In  such 
a case  neither  the  randomization  nor  the  multinomial  models  are  strictly 
appropriate,  but  the  multinomial  models  do  represent  the  conditions  which 
the  selection  and  allocation  of  units  aim  to  attain. 

This  discussion  is  not  at  all  a digression.  It  is  rather  a paradigm 
of  the  features  which  distinguish  areas  of  applied  mathematics  (such  as 
statistics)  from  mathematics-for-its-own- sake.  It  is  time,  however,  to 
assume  that  one  of  the  multinomial  models  adequately  describes  the  data  and 
to  turn  to  tests  of  HA,  or  H^,.  If  we  wish  to  detect  any  deviation  from 
the  null  hypothesis  (not  just  deviations  in  some  specified  direction),  an 
omnibus  test  is  in  order.  And  if  the  sample  sizes  are  moderately  large, 
chi-square  methods  provide  such  tests. 

We  will  follow  the  pattern  of  Section  4,  denoting  the  unknown  vector 

of  parameters  by  0.  In  the  contingency  table  case,  0 consists  of  a set  of 

cell  probabilities  which  determine  all  of  the  p. . when  combined  with  the 

ij 

constraints  imposed  by  the  model  and  by  the  null  hypothesis.  Under  Model  B, 
for  example,  we  will  take  0 = (p11,...,p1  s_1)',  since  (21)  and  (23)  then 
determine  the  complete  set  of  cell  probabilities.  The  estimators,  test 
statistics,  and  limiting  distributions  discussed  below  do  not  depend  on  this 
particular  choice  of  s-1  cell  probabilities  for  0,  but  the  dimension  m = s-1 
of  0 is  a consequence  of  the  model  and  the  hypothesis. 
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The  probability  function  in  Model  B is 


N1  ! 


N, 


S Nli  N2  ! - ”2 • 

N,,l...NTT!  A Plj  A P2j  J 


11' 


Is'  j=l 


’21' 


2s'  j=l 


To  estimate  9 by  the  maximum  likelihood  method,  express  each  p. . as  p. .(9) 
and  the  probability  function  for  given  NL  . as  a function  L(9)  of  6.  Then 
solving 

k = i s-i 

3°k  plk  pls  Plk  pls 

(recall  0^  = plk)  produces  the  MLE 

N , 


Nlk  + N2k 


lk 


k = l,...,s-l. 


This  is  the  "obvious"  estimator  of  plk  under  Hp,  namely  the  overall  relative 

frequency  of  the  kth  response  in  the  two  samples.  Another  way  to  describe 

plk  ^ the  weighted  arithmetic  mean  of  the  relative  frequencies  N /N 

1 k 1 * 

and  N2k//N2*  °f  the  kth  resPonse  in  the  separate  samples.  The  Pearson 
chi-square  statistic  for  two  independent  s-nomials  is 
l [Nlj  ' nplA)]"  + | [N2j  - np2i(9)]2 


j=l  np^fS) 


j=l  np2j(0) 


. ? ? [Nij  - 

iii  ji,  n TjrjTT- 


1*  *J 


When  the  p_  are  known,  this  statistic  has  (s-1)  + (s-1)  = 2s  - 2 degrees 
of  freedom.  Here,  m = s-1  parameters  were  estimated  by  the  multinomial  MLE 
method,  so  since  (2s-2)  - (s-1)  = s-1,  the  limiting  null  distribution  is 
X2(s-1) • 

If  the  data  of  table  (18)  arose  from  a single  random  sample  (Model  A), 
the  probability  function  is 
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2 s 

n n 


N.  . 


N.  . 
ij 


i=l  j=l  ij 


ij 


The  unknown  parameter  can  be  taken  to  be  0 = (p  , ...,p  )'  since  (20) 

and  (22)  then  determine  the  full  set  of  cell  probabilities.  Once  again 
other  choices  of  9 are  possible  but  all  have  m = s.  Computing  the  MLE  of 
9 gives  the  natural  estimator  for  Model  A under  H^,  namely 

N,  N , 

P = 1 * 

1 Ik  n n 

(Compare  (22)  to  see  why  this  is  the  natural  estimator.)  The  Pearson 
chi-square  statistic  for  this  single  2s-nomial  model  is 

2 s [N. . - np..(0)]“  2 s [N. . - N.  N . /n]*- 

y y lll_  _ = v y -Lu ?■:.  :i.  J. 

•i  • , ,s. . .L,  • . N.  N . /n 

i=l  j=l  np^^ (0)  i=l  j=l  i*  •] 

and  has  2s-m-l  = s-1  degrees  of  freedom. 


Look  closely.  The  PeaAAon  civi- A q uoac  -itatcitici  |$oa  testing  HA  in 

Model  A and  testing  Hg  in  Model  B ate  ident-icaJt,  and  have  the  tame. 

2 .... 

X (s-1)  tirruJXng  mitt  distribution.  And  of  course  the  same  statistic  results 
from  testing  in  Model  C.  This  serendipitous  outcome  depends  very  much  on 
the  fact  that  maximum  likelihood  estimation  was  used.  As  Section  4 proclaimed, 
asymptotically  equivalent  statistics  can  be  obtained  by  using  either  the 
minimum  chi-square  or  the  minimum  modified  chi-square  method  to  estimate 
0.  But  only  asymptotically  equivalent.  Let  us  apply  the  minimum  modified 
chi-square  method  to  Model  B.  We  must  choose  0 = (p^,...,p^  g p ' to 
minimize  the  modified  chi-square  statistic 

% [Nlj  - Ni.PijC0)]2  S [N2j  - N2.p2i(0)]2 

L N + L N 

j=l  lj  j-1  2j 

Differentiation  followed  by  a short,  ugly  calculation  shows  that  the 
minimum  modified  chi-square  estimator  of  p^  is  proportional  to  the  weighted 
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hctsimoHstc  mean  of  the  relative  frequencies  N^/N^  and  for  the  kth 

response.  While  not  entirely  outrageous,  this  is  surely  less  appealing 
than  the  MLE  result.  In  Model  A,  the  situation  is  worse:  the  equations 

arising  from  differentiating  the  modified  chi-square  statistic  are  nonlinear, 
and  have  no  closed-form  solution.  The  Pearson  statistics  for  Models  A and 
B when  minimum  modified  chi-square  estimators  are  used  are  not  identical. 

The  minimum  chi-square  estimators  have  no  explicit  expressions  in  either 
model,  and  again  the  chi-square  statistics  differ.  No  wonder  the  MLE  is 
always  used  for  contingency  tables. 

There  is  a pattern  to  the  use  of  these  latter  estimation  methods  in 
hypothesis  testing  for  independent  multinomial  observations.  The  minimum 
modified  chi-square  method  produces  a set  of  Zine.aA  equations  to  be  solved 
for  the  estimated  parameters  whenever  the  hypothesis  is  linear  in  the  cell 
probabilities.  This  was  true  of  Hg  but  not  of  H^.  In  many  situations  it  is 
easier  to  compute  minimum  modified  chi-square  estimators  than  the  MLE's. 

Minimum  chi-square  estimators,  on  the  other  hand,  can  rarely  be  obtained  in 
closed  form  and  are  seldom  used. 

6.  A FURTHER  RANGE 

This  survey  of  chi-square  tests  has  entirely  ignored  several  areas  of 
considerable  interest  to  users  of  these  tests.  Computers  make  it  feasible 
to  obtain  the  exact  distributions  of  the  test  statistics  in  small  samples, 
both  for  use  and  for  assessment  of  the  accuracy  of  the  chi-square  approximations. 

The  relative  power  of  the  tests  can  be  studied  either  by  calculation  and 
simulation  or,  in  large  samples,  by  various  mathematical  devices.  I have 
chosen  to  restrict  this  essay  to  the  study  of  large  sample  distribution 
theory  under  the  null  hypothesis.  Even  here  there  is  a further  range  of 

■ 




theory  which  both  opens  up  new  possibilities  for  the  user  and  illustrates 

the  use  of  increasingly  sophisticated  mathematics  in  statistical  theory. 

We  have  been  assuming  almost  without  reflection  that  the  number  of 

cells  M in  a chi-square  test  of  fit  remains  fixed  as  the  sample  size 

increases,  and  that  the  cells  are  fixed  without  regard  to  the  data. 

Neither  assumption  is  necessarily  realistic  as  a description  of  statistical 

practice.  It  is  common  to  use  more  cells  when  blessed  with  a larger  sample, 

and  equally  common  (though  less  publicly  admitted)  to  move  the  cells  to  the 

data.  What  are  the  consequences  of  incorporating  these  innovations  in  the 

chi-square  statistics  of  Section  4? 

Increasing  the  number  of  cells  M as  the  sample  size  n increases  has 

radical  consequences.  When  M grows  with  n,  the  Pearson  statistic  for  testing 

fit  to  a completely  specified  distribution  has  a nonmaZ,  not  a chi-square, 

limiting  null  distribution  when  properly  standardized.  This  is  in  accord 

2 

with  intuition,  since  the  x (M-l)  limiting  distribution  of  the  Pearson 

statistic  approaches  normality  as  M °°.  (Apply  the  central  limit  theorem 
M“  1 2 

to  Zj  ZZ.)  One  expects  that  the  limiting  null  distribution  when  parameters 
are  estimated  will  also  be  normal,  with  a different  standardization  perhaps 
required.  No  proof  of  this  has  been  given.  The  lack  of  attention  to  this 
problem  may  be  due  in  part  to  simulations  suggesting  that  replacing  M by 
M(n)  -*■  00  produces  tests  which  compare  favorably  with  fixed-M  chi-square  tests 
only  against  very  short-tailed  alternatives. 

The  second  innovation,  use  of  data-dependent  cells,  has  been  better 
studied  and  is  finding  its  way  increasingly  into  practical  use.  Suppose  then 
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that  the  cells  E^  are  replaced  by 


E.  (X. ... .,X  ) 
in  v 1 ’ ' n 


i = 1 , . . . ,M 


in  the  general  chi-square  statistic  (6)  of  Section  4.  For  simplicity  we 

consider  only  univariate  bservations  X.  and  cells  E.  = (a.  , .a.  1 which 

j in  i-l,n  inJ 

are  intervals  with  endpoints  a.  = a.  (X.,...,X  ).  It  is  only  reasonable 

r in  min  J 

to  demand  that  the  random  cells  settle  down  as  the  sample  size  increases. 


a.  a.  = a.  _(9  ) (P) 
in  lO  i0  0 


unaer  G(- | 0 ) . 


An  example  of  useful  data-dependent  cell  boundaries  is  a.  = X + c.s  in 

in  n l n 

testing  fit  to  the  univariate  normal  family.  The  sample  mean  X moves  the 

n 

cells  to  the  data,  and  the  sample  standard  deviation  s adjusts  the  cell 

n 

widths  to  the  dispersion  of  the  data.  Here  a^Q(p,a)  = u + cm  are  the 
limiting  cell  boundaries. 

If  a denotes  the  vector  of  cell  boundaries  (a.  = -°°,  a = +») , then 

n On  Mn  J 

the  "cell  probabilities"  under  the  null  hypothesis  are  now 


Pi(e*an)  = / dGCxje)  = G(a.n|e)  - G(a._1>n|e) 


i-1  ,n 


The  M- vector  of  standardized  cell  frequencies  becomes  V^fQ.a^)  with  ith 


component 


N.  (a  ) - np. (0,a  ) 
i n riv 


[np.(0,an)]  2 

where  N (a  ) is  the  number  of  X.,...,X  in  E.  . But  the.  c.<M  ixeouenci<U> 
in  I n in  u i 

^i^an^  n0  ^on3&L  since  the  cell  boundaries  are  dependent 

on  the  observations  X^  being  counted.  The  central  mathematical  hurdle  of 

establishing  asymptotic  normality  of  V (0  ,a  ) for  estimators  0 and  random 

n n n n 

cell  boundaries  a^  is  now  much  more  difficult.  Fortunately,  there  is  an 
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elegant  modern  technique  leading  to  a pleasing  result. 

The  pleasing  result  is  that  the  asymptotic  distribution  of  Vn(en>an) 
under  G(*|0Q)  is  the  same  as  that  of  vn(9n>ao)*  T^at  is,  asymptotic, 
bediavton  0(5  any  random- celt  cki-squaAc  statistic  is  nx.atM.ij  that  0j$  the.  same 
Statistic  us-ing  the  Liniiting  celt  boundati&s  a^.  Speaking  roughly,  the  use 
of  data-dependent  cells  has  no  effect.  The  naive  user  who  moves  his  cells 
to  the  data  is  safe.  Actually,  the  dependence  of  the  limiting  cells  on  the 
(unknown)  true  9^  complicates  these  rough  conclusions  slightly.  But  any  of 
the  statistics  in  Section  4 which  has  a 90~free  limiting  null  distribution 
using  fixed  cells  has  that  same  distribution  using  any  set  of  converging 
random  cells. 

What  of  the  promised  elegant  modern  technique?  Since  ^n(en«aQ)  is  just 
our  old  acquaintance  Vn(0  ) for  a particular  set  of  fixed  cells,  let  us  use 
the  latter  notation.  We  must  show  that  under  G(*|e  ) 

V (0  ,a  ) - V (9  ) = o (1). 
n n n n n pv 

Since  both  P^(en»an)  ancl  P^(en)  converge  in  probability  to  P^(9q)  whenever  9^ 
is  a consistent  sequence  of  estimators  and  p^  is  continuous,  their  presence 
in  the  denominators  can  be  ignored.  We  need  only  prove  that  the  expression 

(24)  n‘"2[N.(an)  - np.(0n>an)]  - n'2[N.  - np.(0n)] 

is  o(l). 

Introduce  the  cmpitic  distribution  function 

Gn(t)  = n ^Number  of  Xj,...,Xn:  X^  <_  t } . 

This  is  the  natural  estimator  of  the  common  df  of  the  X^ . It  increases 
from  0 to  1 in  jumps  of  1/n  at  each  observation.  At  any  fixed  t,  Gn(t)  is 
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a multiple  of  a binomial  random  variable  with  success  probability  G C 1 1 ® q) 
when  this  is  the  true  df  of  the  X^ . So  the  emp-ixic  d-LbtxLbuitLon  fiuncCton 
psioceAA 


‘Sr 


Wn(t)  = na{Gn(t)  - G(t|e0)> 


has  a normal  limiting  distribution  for  each  fixed  t.  Since 

n"V(a  ) = n^{G  (a.J  - G (a.  . )} 

n n in  nv  i-l,n' 


P.(6,an)  = G(ain|0)  - G(a._1>n|0), 


there  is  some  hope  of  expressing  (24)  in  terms  of  and  using  the 
convergence  properties  of  that  process  to  achieve  our  goal. 

In  Section  3 we  saw  that  convergence  in  distribution  for  random  variables 
amounted  to  describing  a random  variable  by  a probability  measure  on 
and  defining  convergence  as  convergence  of  the  measures  F (A)  of  all  Borel 
sets  A having  boundaries  with  measure  0 under  the  limiting  distribution. 

This  development  extends  at  once  to  more  general  spaces  than  R*5.  Indeed, 
such  an  extension  is  the  primary  topic  of  Billingsley's  book  [1]  which  was 
cited  in  Section  3.  Now  a stochastic  process  such  as  Wn  is  a xmdom  function  - 
a function  of  the  real  variable  t which  varies  with  the  underlying  probability 
mechanism  generating  the  . Just  as  a random  variable  can  be  identified 
with  a probability  measure  on  R^,  so  a random  function  can  be  identified  with 
a probability  measure  on  a suitable  function  space.  Convergence  in  distribution 
for  processes  then  has  the  same  definition  as  convergence  in  distribution  for 
random  variables.  This  viewpoint,  adopted  from  functional  analysis,  has 
become  a standard  tool  of  statistical  large  sample  theory.  Billingsley's 
book  is  a basic  exposition,  restricted  to  metric  spaces,  Borel  a-fields,  and 
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random  functions  of  a single  real  variable. 

It  turns  out  that  W^(t)  does  converge  in  distribution  to  a process 
Wg(t)  which  is  a variation  of  Brownian  motion,  one  of  the  most  familiar 
stochastic  processes.  This  is  an  analog  of  the  central  limit  theorem.  Of 
this  powerful  result  we  need  only  two  details:  Wg(t)  is  continuous  with 

probability  1,  and  the  function  space  on  which  and  Wg  are  probability 
measures  has  the  property  that  convergence  to  a continuous  limit  function 
is  always  uniform. 


The  machinery  to  crush  (24)  is  now  assembled.  Arithmetic  shows  that 


(24)  is 


{VaiJ  - VaiO)}  ’ {"^ai-lfn>  ‘ Vai-1,0)} 

(25)  % ^ 

+ n {PiCan’V  - Pi(V6n)}  - n2{Pi(a0*V  ‘ pi°Ven)} 

Applying  the  mean  value  theorem  to  the  last  two  terms  in  (25)  gives 
3p . * 3p.  **  , 

‘-5?  (Van>  - -5?  ce„  ,a0)]'n'(en-80) 

★ ** 

for  0^,  0^  between  0^  and  0.  This  is  Op(l)  whenever  the  m-vector  of 

j, 

derivatives  3p^/80  is  continuous  and  n2(0n~0g)  converges  in  distribution. 
The  first  two  terms  in  (25)  have  the  form 


W - Vc) 

where  c^  -*■  c(P).  Now  the  two  general  facts  of  Section  3 apply  to  convergence 
in  distribution  of  processes  as  well.  So 


(VCn}  * (W0*C> 


S 


by  the  second  general  fact.  The  function 


cp(f.t)  - f(t)  - f(c) 


is  continuous  with  probability  1 with  respect  to  the  distribution  of  (WQ,c). 
(Check  that:  if  -*•  f unit and  t^  c,  then  cpCf^.t^)  cp(f,c)  = 0. 
That  implies  continuity  with  probability  1 because  wQ(t)  is  continuous,  and 
convergence  to  a continuous  limit  function  is  uniform  in  the  function  space 
at  hand.)  So  by  the  continuity  theorem 

Wn(cn)  - Wn(c)  = cp(Wn,cn)  -*cp(W0,c)  = 0. 

Convergence  in  distribution  to  a constant  is  convergence  in  probability,  so 
we  have  shown  that  (25)  is  0^(1), 

This  essentially  simple  argument  can  be  generalized  to  multivariate 
observations  and  alternative  hypotheses.  A full  treatment  appears  in  [4], 
The  examples  in  the  next  section  illustrate  the  advantages  of  being  free  to 
use  data-dependent  cells. 


7.  SOME  EXAMPLES 

In  this  section  the  results  of  Sections  4 and  6 will  be  applied  to 
produce  several  chi-square  tests  of  fit  to  the  family  of  exponential  densities 


(26) 


g(x|e)  = g e 


0 < x < 


= I e-x/ 9 
0 6 

ft  = { 0 : 0 < 0 < »}. 

There  are  many  tests  of  fit  for  so  standard  a family  which  exceed  the 
chi-square  tests  in  power.  Chi-square  tests  of  fit  have  their  greatest 
potential  usefulness  in  situations  where  other  tests  of  fit  cannot  be  used 
(discrete  or  multivariate  data),  or  where  the  work  of  computing  critical 
points  for  a non-tabled  distribution  is  not  justified.  But  restricting 
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ourselves  to  a single,  simple  hypothesized  family  has  obvious  expository 
advantages . 

Suppose  then  that  Xj,...,X  are  independent  random  variables  having  a 
common  unknown  distribution  which  is  hypothesized  to  belong  to  the  family 
(26). 

The  PeaAAon  AtatiAttc.  Choose  M fixed  cells  = (a^  ^,a^]  partitioning 
(0,°°).  The  cell  probabilities  under  the  null  hypothesis  are 

ai  , „/A  -a.  . / 0 -a./6 

p.(0)  = / -q  e dx  = a.  e 1_  - a.e  1 

ri  1 9 l-l  i 

ai-l 

We  will  estimate  6 by  the  grouped  data  MLE  0^.  The  equation  resulting  from 
differentiating  the  logarithm  of  the  multinomial  probability  function  of 
with  respect  to  0 is 


-a.  . / 0 -a./0 

M a.  , e 1 -a.e  1 

(27)  l N> 

e - e 


= 0. 


This  has  no  closed-form  solution,  but  is  easily  solved  iteratively  to  obtain 

0n.  Substituting  this  numerical  value  into  the  Pearson  statistic 

M [N.-np. (0  )]2 
£ l *i  n 1 

i=1  npi(®n) 

gives  a test  statistic  with  approximately  the  y2(M-2)  null  distribution. 

LlA-Lng  the  Aaw  data.  MLE.  The  maximum  likelihood  estimator  of  0 from 

X|»...,X  is  the  sample  mean,  0 = X . Who  would  wish  to  solve  (27)  when 

in  n n 

will  do  our  estimating?  If  is  used  in  the  Pearson  statistic,  a 
0-dependent  limiting  null  distribution  results.  But  a nice  feature  of 
random  cells  now  appears:  in  testing  fit  to  location-parameter  and  scale- 

parameter  families,  random  cells  can  eliminate  the  0-dependence  of  the  null 
distribution.  In  this  case,  we  use  cells 


r 
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Ei<V  ■ <ci-iV 


where  the  c.  are  constants.  In  the  notation  of  Section  6,  a.  = c.X  and 
i in  i 


c.X 

l 


(28) 


Pi^’V 


/ . e* 


-c.  ,X/0  -c.X/9 

..  l-l  i 

dx  = e -e 


ci-lX 


-c._j  -c. 

The  estimated  cell  probabilities  p^(X,an)  = e 1 - e 1 do  not  depend  on 

the  sample!  The  Pearson  statistic  for  random  cells  of  this  form  is 
algebraically  unchanged  by  the  transformation  X^  ■+  X^/0  because  the  cell 
boundaries  move  in  such  a way  as  to  keep  the  cell  frequencies  as  well  as 
the  estimated  cell  probabilities  fixed.  Since  X^/0  has  the  g(-jl)  density- 
function  when  Xj  has  the  g ( * j 0)  density  function,  the  distribution  of  the 
statistic  does  not  depend  on  0.  If  in  particular  c^  = -log(l  - j^) , we 
obtain  M equiprobable  cells,  p^  = 1/M. 

The  use  of  random  cells  thus  produces  a 9-free  null  distribution.  Not 
only  that,  the  choice  of  equiprobable  cells  simplifies  the  computation  of  the 
statistic  and  has  been  shown  to  have  good  power  properties  when  fit  to  a 
single  distribution  is  being  tested.  But  the  limiting  distribution  is  not 
chi-square.  It  is  the  distribution  of 


X2(M-2)  + AZ^, 


and  requires  a special  computation  to  obtain  critical  points  even  though  X does 
not  depend  on  0 . 

U-&tng  a.  cLi^eA&nt  quuddsuitLc  faoHm.  Both  of  the  statistics  we  have  thus 

far  applied  to  testing  fit  to  the  family  (26)  have  disabilities  in  ease  of 

use.  The  statistic  of  Rao  and  Robson  can  be  explicitly  computed,  has  a 
2 

X (M-l)  limiting  distribution,  and  also  appears  from  simulations  to  be  more 
powerful  than  its  two  competitors.  Let  us  find  its  form  for  this  problem. 

The  general  forms  of  both  the  Cn  and  D(0n)  statistics  from  Section  4 simplify 
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because  = ® implies  that 

, M N.  3p . M N.  3p. 

VnB  = n p1  P~  3^* 

1=1  *i  1 1=1  *i  m 

When  m = 1,  the  Rao-Robson  statistic  reduces  to 


M (N  -np  )‘ 

R = ) — = — 

n **  rvn 


, M N . dp . 0 

2_  f y _l  Jl)2 

nD  p.  ddJ 


where 


M , dp.  0 

.1, 

i=i  *i 

and  J,  p.,  dp./d6  are  all  evaluated  at  9 = 0 . 
ri  ri  n 

The  use  of  equiprobable  random  cells  continues  to  have  advantages  in 
simplicity  and  (probably)  in  power,  so  the  cells  (c^  ^X,c^X]  for 
c^  = -log(l  - i)  will  again  be  employed.  From  (28), 


Ip.  c.  .X  -c.  .X/0  c.X  -c.X/9 

*i  _ l-l  l-l  i l 

de  e=x  e2  e2  e=x 


i -c.  i -c. 

l , i-i  i. 

= — (c.  ,e  - c.e  ) 

v i-i  i 


and  a short  calculation  shows  that  the  test  statistic  is 


5J, 

-c.-l  -c. 

Here  d.  = c.  . e 1_  -c.e  1 and  N.  is  the  number  of  X, ,...,X  in 
li-l  i i In 

(ci_lX,  c^X] . This  is  the  recommended  chi-square  statistic  for  this 


problem. 


C<LMon.zd.  data..  It  is  quite  common  in  experiments  on  reliability  or 


survival  time  not  to  wait  for  all  the  lightbulbs  to  bum  out  or  all  the 
drugged  rats  to  die.  The  lifetimes  are  observed  in  order,  so  let  us  denote 
the  ordered  observations  by 
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X(l)  < X(2)<-,-<X(n) 


and  suppose  that  observations  are  stopped  after  the  sample,  a- quantile,  is 


observed  for  some  0 < a < 1.  This  is  the  order  statistic  X,r  ...  where 

( [na] ) 


[na]  is  the  greatest  integer  in  na.  The  exponential  distributions  form 
a very  common  model  in  life-testing,  so  it  is  useful  to  test  fit  to  this 
family  given  only  the  censored  data 


X(l)  < X(2)<" ’ -<X([na]) ’ 


Here  is  a challenge  to  the  great  flexibility  of  chi-square  methods, 
The  response  is  to  use  random  cells  with  boundaries  given  by  sample 
^-quantiles  E.  = X^,,  for 

0 


6o  < V' 


■<SH-1  ■ “ < «M  " *• 


so  that  the  n - [na]  unobserved  lifetimes  fall  in  the  rightmost  cell.  Of 


course,  £q  = 0 and  Cw  = 00 . These  random  cells  fit  the  demands  of  Section  6, 


for  the  sample  quantile  £.  converges  in  probability  to  the  population 


^-quantile  x^(9)  as  n + ».  The  population  quantiles  are  found  from 


[xi  1 e-x/e 

Jo  9 


dx  = (5. . 

l 


This  choice  of  cells  produces  nonAandom  cell  frequencies 


Ni  = [n^]  - [n6. _1]: 


but  the  theory  of  Section  6 is  entirely  undisturbed  by  this  rather  odd 
happenstance.  We  may  cheerfully  compute  a variety  of  chi-square  statistics 
for  these  cells,  but  we  will  content  ourselves  with  the  Pearson  statistic. 

The  grouped  data  MLE  is  computed  exactly  as  in  (27),  "ignoring"  the  fact  that 
the  cell  boundaries  are  random.  That  is,  find  ^ = V5!'**"  nun>erically 

by  solving 


l 


Jl  “'W  ' °’ 

e - e 


then  use  the  statistic 


^ [Ni  ~ nPj^n)r 


i=l  np. (0  ) 
ri  n 

-C  /e  -€./0  2 

with  p^(0)  = e - and  critical  points  from  the  x (M-2) 

table.  Hats  off  to  the  amazing  chi-square  statistics. 
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